Comprehensive computational analysis of epigenetic descriptors affecting CRISPR-Cas9 off-target activity

Background A common issue in CRISPR-Cas9 genome editing is off-target activity, which prevents the widespread use of CRISPR-Cas9 in medical applications. Among other factors, primary chromatin structure and epigenetics may influence off-target activity. Methods In this work, we utilize crisprSQL, an off-target database, to analyze the effect of 19 epigenetic descriptors on CRISPR-Cas9 off-target activity. Termed as 19 epigenetic features/scores, they consist of 6 experimental epigenetic and 13 computed nucleosome organization-related features. In terms of novel features, 15 of the epigenetic scores are newly considered. The 15 newly considered scores consist of 13 freshly computed nucleosome occupancy/positioning scores and 2 experimental features (MNase and DRIP). The other 4 existing scores are experimental features (CTCF, DNase I, H3K4me3, RRBS) commonly used in deep learning models for off-target activity prediction. For data curation, MNase was aggregated from existing experimental nucleosome occupancy data. Based on the sequence context information available in crisprSQL, we also computed nucleosome occupancy/positioning scores for off-target sites. Results To investigate the relationship between the 19 epigenetic features and off-target activity, we first conducted Spearman and Pearson correlation analysis. Such analysis shows that some computed scores derived from training-based models and training-free algorithms outperform all experimental epigenetic features. Next, we evaluated the contribution of all epigenetic features in two successful machine/deep learning models which predict off-target activity. We found that some computed scores, unlike all 6 experimental features, significantly contribute to the predictions of both models. As a practical research contribution, we make the off-target dataset containing all 19 epigenetic features available to the research community. Conclusions Our comprehensive computational analysis helps the CRISPR-Cas9 community better understand the relationship between epigenetic features and CRISPR-Cas9 off-target activity. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-09012-7.


Supplementary Tables and Figures
Extending Figure 1, Supplementary Table 1 lists the off-target cleavage activity Spearman and Pearson correlation values for all experimental epigenetic and computed nucleosome organization-related features. Figure 1 shows the architecture of the convolutional neural network used for CRISPR-Cas9 off-target activity prediction and SHAP value analysis. Extending Figure 1, Supplementary Figures 2, 3 and 4 show the cell line-based heatmaps indicating Spearman and Pearson correlations between the epigenetic features and CRISPR-Cas9 off-target activity for cell lines HeLa, K562 and U2OS, respectively. Extending Figure 2, Supplementary Figures 5 and 6, respectively, show the violin and distribution plots for CRISPR-Cas9 off-target cleavage activity for all 19 experimental epigenetic features and computed nucleosome organization-related, with the experimental epigenetic features highlighted in bold. Extending Figures 3 and 4, Supplementary Figures 7 and 8 visualize the SHAP contribution of each input feature in a trained XGBoost and CNN model, respectively, both of which predict CRISPR-Cas9 off-target activity, where all computed nucleosome organization-related scores are base pair-resolved.
Using only the 'on-target' datapoints that correspond to guide-RNA-on-target DNA sequence pairs, Supplementary Figures 10 shows an overall correlation analysis. It can be seen that Nucleotide (and Strong-Weak) BDM still show the highest Spearman correlation with on-target cleavage activity, even though the difference in correlation values is not as pronounced as found for the off-target cleavage activity dataset. Supplementary Figure 1: Convolutional neural network architecture used for CRISPR-Cas9 off-target activity prediction as mentioned in the Methods section. The architecture is implemented in PyTorch [1]. The input to the neural network is a 23bp × 22 features input matrix, and the output is a scalar value indicative of the CRISPR-Cas9 off-target activity prediction. The architecture consists of five one-dimensional convolutional (Conv1D) layers followed by one fully connected layer. The first layer is a Conv1D layer with 32 channels, 3 × 3 kernel size, stride of 2 and padding of 0, followed by leaky rectified linear unit activation (LeakyReLU) [2] with a negative slope of 0.2. The second layer is a Conv1D layer with 64 channels, 3 × 3 kernel size, stride of 1 and padding of 0, followed by LeakyReLU [2] with a negative slope of 0.2. The third layer is a Conv1D layer with 128 channels, 3 × 3 kernel size, stride of 2 and padding of 0, followed by 1D batch normalization and subsequently LeakyReLU [2] with a negative slope of 0.2. The fourth layer is a Conv1D layer with 256 channels, 3 × 3 kernel size, stride of 1 and padding of 0, followed by 1D 3 × 3 max pooling with padding of 1 and stride of 1, and subsequently rectified linear unit activation (ReLU). The fifth layer is a Conv1D layer with 512 channels, 2 × 2 kernel size, stride of 1 and padding of 0, followed by 1D 3 × 3 max pooling with padding of 1 and stride of 1, and subsequently rectified linear unit activation (ReLU). The final layer is a fully connected layer which outputs a scalar value.     , and the computed nucleosome organization-related scores GC147 [3], W/S scheme, YR scheme [4,5], Strong-Weak BDM, Nucleotide BDM [6,7], NuPoP (Occupancy), NuPoP (Affinity), NuPoP (Viterbi) [8], nuCpos (Occupancy), nuCpos (Affinity), nuCpos (Viterbi) [9], VanDerHeijden [10] and LeNup (H3Q85C) [11], with the computed scores sorted by decreasing SHAP value importance as shown in Figure 3. , and the computed nucleosome organization-related scores GC147 [3], W/S scheme, YR scheme [4,5], Strong-Weak BDM, Nucleotide BDM [6,7], NuPoP (Occupancy), NuPoP (Affinity), NuPoP (Viterbi) [8], nuCpos (Occupancy), nuCpos (Affinity), nuCpos (Viterbi) [9], VanDerHeijden [10] and LeNup (H3Q85C) [11], with the computed scores sorted by decreasing SHAP value importance as shown in

Nucleosome Organization-Related Tools
This section provides additional background on some of the nucleosome organization-related tools used in this study.

GC Content/GC147
Mathematically, the GC content of a 147bp nucleosomal sequence s is defined by GC content significantly correlates with in vitro nucleosome occupancy in budding yeast [12,3]. Specifically, it was shown that GC content was the dominant feature in a linear model of nucleosome occupancy based on GC content and 14 other DNA sequence-related input features [3]. Nonetheless, GC content is not sufficient for in vitro nucleosome occupancy prediction, since it is not indicative of nucleosome occupancy levels in GC-rich regions in vitro [13].

W/S and YR Schemes
The W/S scheme is based on the well-established DNA sequence pattern where weak-weak (WW) and strong-strong (SS) dinucleotides are periodically located on the histone octamer-facing minor and major grooves, respectively. Mathematically, W/S scheme is defined as WSScore(s) = The YR scheme is based on a weighted sum of GC, YR, YYRR and RYRY counts in the different sites. Further detailed descriptions on how the YR scheme predicts translational positioning can be found in [4] and [5].

Van Der Heijden Algorithm
Based on the dinucleotide wedge model [14], the likelihood ratio for each base pair position is given by where S is the sequence with |S| ≈ 147 centered on the base position. In addition, the position-dependent dinucleotide probabilities are defined by where B and p are the amplitude and period of the dinucleotide frequencies, respectively. Using the likelihood ratios, an energy landscape can be derived. We can then apply the algorithm required for solving Percus's equation [15] in order to generate the nucleosome positioning scores. Nucleosome occupancy values can then be obtained by applying a convolution operation with a 147bp uniform filter. To determine the algorithm's hyperparameters, a Levenberg-Marquadt routine [16] can be used for fitting periodicity p and chemical potential µ to experimental data. In particular, µ is a hyperparameter used when computing the solution to Percus's equation. More details on VanDerHeijden can be found in [10].

Block Decomposition Method-based Measures
Block Decomposition Method (BDM) is a training-free method for approximating the algorithmic complexity of sequences. Mathematically, BDM is founded on the Coding theorem method [17,18], which relates algorithmic (Kolmogorov-Chaitin) complexity [19] with algorithmic probability [20]. Specifically, BDM approximates algorithmic complexity and Shannon entropy for short and long sequences, respectively [6]. Since DNA sequences can easily be represented as a string, BDM scores can readily be computed for DNA sequences.

NuPoP
NuPoP uses a dHMM and a Hidden Markov Model (HMM) for modelling 147bp nucleosomal and linker DNA sequences, respectively. Training data for both models consist of yeast nucleosomal and non-nucleosomal sequences derived from MNase-seq. Both models are used for computing log likelihood ratios, which can be seen as histone binding affinity (HBA) scores. Computationally, the HBA score at position i is given by log P N (Si) G L (Si) where S i is the 147bp sequence centered at position i. P N and G L indicate the probability that the S i is a nucleosomal and linker sequence, respectively. Since linker sequences cannot be too long, NuPoP sets a maximum linker sequence length to 500bp for the dHMM. Using the HBA scores, the forward and backward algorithms can then be used for computing the nucleosome occupancy scores. A Viterbi score can also be computed, which predicts whether a specified nucleotide is located in nucleosomal or linker DNA. More details on the algorithm can be found in [8].